The World Happiness Report is a landmark survey of the state of global happiness that ranks 156 countries by how happy their citizens perceive themselves to be. The world is a rapidly changing place and happiness is something that everyone pursuits in the ever growing complicated world. The United Nations Sustainable Development Solutions Network has been publishing the World Happiness report since 2012. This first report presented the available global data on national happiness and reviewed related evidence from the emerging science of happiness, showing that the quality of people’s lives can be assessed by a variety of subjective well-being measures, collectively referred to then and in subsequent reports as “happiness.” Each report includes updated evaluations and a range of commissioned chapters on special topics digging deeper into the science of well-being, and on happiness in specific countries and regions. We will try to do an Exploratory Data Analysis of the 2019 World Happiness Report. It contains rankings of national happiness based on respondent ratings of their own lives, which the report also correlates with various life factors.
Below are some questions we will expect to answer from our Analysis:
Link to the dataset used for Analysis: https://www.kaggle.com/PromptCloudHQ/world-happiness-report-2019
The data did not require any prior pre-processing and was available in a fairly clean .csv format.
Understanding the ranks in the dataset
Positive Attributes (total 7)
There are 7 columns in the data set that indicate Positive Attributes: Happiness Rank / Ladder, Positive Affect, Social Support, Freedom, Generosity, Log of GDP, Healthy Life expectancy.
Lower number of these attributes indicate better score and greater impact on Happiness Score.
e.g. Singapore ranks # 1 in Healthy Life expectancy and the people think that it contributes a lot to their overall Happiness.
Negative Attributes (total 2)
There are 2 columns in the data set that indicate Negative Attributes: Negative Affect and Corruption.
Higher number of these attributes indicate worst score and greater negative impact on Happiness Score.
e.g. Moldova ranks # 148 in Corruption. This means that the people have indicated their Country has more corruption that affects their Happiness.
The R Libraries used for analysis are UsingR, plotly, ggplot2, corrplot, dplyr, tidyr, rworldmap, prob, sampling and RColorBrewer.
df.whr <- read.csv("world-happiness-report-2019.csv",
header = TRUE,
col.names = c("Country", "Happiness.Rank", "SD.of.Ladder",
"Positive.Affect", "Negative.Affect",
"Social.Support", "Freedom", "Corruption",
"Generosity", "Log.of.GDP", "Healthy.Life"))head() function:## Country Happiness.Rank SD.of.Ladder Positive.Affect Negative.Affect
## 1 Finland 1 4 41 10
## 2 Denmark 2 13 24 26
## 3 Norway 3 8 16 29
## 4 Iceland 4 9 3 3
## 5 Netherlands 5 1 12 25
## 6 Switzerland 6 11 44 21
## Social.Support Freedom Corruption Generosity Log.of.GDP Healthy.Life
## 1 2 5 4 47 22 27
## 2 4 6 3 22 14 23
## 3 3 3 8 11 7 12
## 4 1 7 45 3 15 13
## 5 15 19 12 7 12 18
## 6 13 11 7 16 8 4
str() function:## 'data.frame': 156 obs. of 11 variables:
## $ Country : Factor w/ 156 levels "Afghanistan",..: 44 37 106 58 100 134 133 101 24 7 ...
## $ Happiness.Rank : int 1 2 3 4 5 6 7 8 9 10 ...
## $ SD.of.Ladder : int 4 13 8 9 1 11 18 15 23 10 ...
## $ Positive.Affect: int 41 24 16 3 12 44 34 22 18 64 ...
## $ Negative.Affect: int 10 26 29 3 25 21 8 12 49 24 ...
## $ Social.Support : int 2 4 3 1 15 13 25 5 20 31 ...
## $ Freedom : int 5 6 3 7 19 11 10 8 9 26 ...
## $ Corruption : int 4 3 8 45 12 7 6 5 11 19 ...
## $ Generosity : int 47 22 11 3 7 16 17 8 14 25 ...
## $ Log.of.GDP : int 22 14 7 15 12 8 13 26 19 16 ...
## $ Healthy.Life : int 27 23 12 13 18 4 17 14 8 15 ...
# Add Continent column:
asia <- c("Israel", "United Arab Emirates", "Singapore", "Thailand", "Taiwan",
"Qatar", "Saudi Arabia", "Kuwait", "Bahrain", "Malaysia", "Uzbekistan", "Japan",
"South Korea", "Turkmenistan", "Kazakhstan", "Turkey", "Hong Kong", "Philippines",
"Jordan", "China", "Pakistan", "Indonesia", "Azerbaijan", "Lebanon", "Vietnam",
"Tajikistan", "Bhutan", "Kyrgyzstan", "Nepal", "Mongolia", "Palestinian Territories",
"Iran", "Bangladesh", "Myanmar", "Iraq", "Sri Lanka", "Armenia", "India", "Georgia",
"Cambodia", "Afghanistan", "Yemen", "Syria", "Laos")
europe <- c("Norway", "Denmark", "Iceland", "Switzerland", "Finland",
"Netherlands", "Sweden", "Austria", "Ireland", "Germany",
"Belgium", "Luxembourg", "United Kingdom", "Czech Republic",
"Malta", "France", "Spain", "Slovakia", "Poland", "Italy",
"Russia", "Lithuania", "Latvia", "Moldova", "Romania",
"Slovenia", "Northern Cyprus", "Cyprus", "Estonia", "Belarus",
"Serbia", "Hungary", "Croatia", "Kosovo", "Montenegro",
"Greece", "Portugal", "Bosnia and Herzegovina", "Macedonia",
"Bulgaria", "Albania", "Ukraine")
north.america <- c("Canada", "Costa Rica", "United States", "Mexico",
"Panama","Trinidad and Tobago", "El Salvador", "Guatemala",
"Jamaica", "Nicaragua", "Dominican Republic", "Honduras",
"Haiti")
south.america <- c("Chile", "Brazil", "Argentina", "Uruguay",
"Colombia", "Ecuador", "Bolivia", "Peru",
"Paraguay", "Venezuela")
australia <- c("New Zealand", "Australia")
africa <- c("Libya", "Mauritius", "Nigeria", "Algeria", "Morocco", "Cameroon", "Ghana",
"Ivory Coast", "Benin", "Congo (Brazzaville)", "Gabon", "South Africa", "Senegal",
"Somalia", "Namibia", "Niger", "Burkina Faso", "Guinea", "Gambia", "Kenya", "Mauritania",
"Mozambique", "Tunisia", "Congo (Kinshasa)", "Mali", "Sierra Leone", "Chad", "Ethiopia",
"Swaziland", "Uganda", "Egypt", "Zambia", "Togo", "Liberia", "Comoros", "Madagascar",
"Lesotho", "Burundi", "Zimbabwe", "Botswana", "Malawi", "Rwanda", "Tanzania", "Central African Republic",
"South Sudan")
df.whr$Continent <- NA
df.whr[df.whr$Country %in% australia, ]$Continent <- "Australia"
df.whr[df.whr$Country %in% africa, ]$Continent <- "Africa"
df.whr[df.whr$Country %in% south.america, ]$Continent <- "South America"
df.whr[df.whr$Country %in% north.america, ]$Continent <- "North America"
df.whr[df.whr$Country %in% europe, ]$Continent <- "Europe"
df.whr[df.whr$Country %in% asia, ]$Continent <- "Asia"## Country Happiness.Rank
## 1 Finland 1
## 2 Denmark 2
## 3 Norway 3
## 4 Iceland 4
## 5 Netherlands 5
## 6 Switzerland 6
## 7 Sweden 7
## 8 New Zealand 8
## 9 Canada 9
## 10 Austria 10
## Country Happiness.Rank
## 147 Haiti 147
## 148 Botswana 148
## 149 Syria 149
## 150 Malawi 150
## 151 Yemen 151
## 152 Rwanda 152
## 153 Tanzania 153
## 154 Afghanistan 154
## 155 Central African Republic 155
## 156 South Sudan 156
We have categorized our analysis between All Countries (with no missing values), Top 10 and Bottom 10 ranking Countries and Continents / Regions.
We will first understand the attributes that contributes the most and least to the Happiness of a Country. To do that, we will try to plot a correlation matrix of all attributes.
From the above plot, we conclude below:
Social Support, Healthy Life expectancy and Log of GDP contribute the most to a better Happiness rank of the Country.
Corruption has the least correlation to Happiness rank of a Country. As the sense of Corruption in people gets higher, the Happiness of the Country will lower.
Log of GDP and Healthy Life expectancy are the most correlated attributes to each other.
We will prove the correlation analysis by plotting the pairs plot below:
We will now conduct our analysis on Top 10 happiest and Bottom 10 unhappy Countries. We will try to understand how different aspects of a Country impacts its Happiness score.
Numerical Variable: Positive Affect
Categorical Variable: Country
top10.positive.bar <- plot_ly(top10.df.whr, y = ~Country, x = ~Positive.Affect,
type = 'bar', name = 'Top 10') %>%
layout(title = "Barplot of Positive Emotion",
yaxis = list(title = 'Country', autorange = "reversed"),
xaxis = list(title = 'Positive Affect'))
bot10.positive.bar <- plot_ly(bottom10.df.whr, y = ~Country, x = ~Positive.Affect,
type = 'bar', name = 'Bottom 10') %>%
layout(title = "Positive Emotion",
yaxis = list(title = 'Country'),
xaxis = list(title = 'Positive Affect'))
subplot(
top10.positive.bar,
bot10.positive.bar,
nrows = 2,
shareX = TRUE
)We will now try to plot all attributes of the Top 10 and Bottom 10 countries.
top10.all <- plot_ly(top10.df.whr, x = ~Country, y = ~Positive.Affect,
type = 'bar', name = 'Positive Affect')
top10.all <- top10.all %>% add_trace(y = ~Negative.Affect, name = 'Negative Affect')
top10.all <- top10.all %>% add_trace(y = ~Social.Support, name = 'Social Support')
top10.all <- top10.all %>% add_trace(y = ~Freedom, name = 'Freedom')
top10.all <- top10.all %>% add_trace(y = ~Corruption, name = 'Corruption')
top10.all <- top10.all %>% add_trace(y = ~Generosity, name = 'Generosity')
top10.all <- top10.all %>% add_trace(y = ~Log.of.GDP, name = 'Log of GDP')
top10.all <- top10.all %>% add_trace(y = ~Healthy.Life, name = 'Healthy Life')
top10.all <- top10.all %>% layout(title = "Top 10: All Attributes",
xaxis = list(title = 'Country'),
yaxis = list(range=c(0,1200),
title = 'Count'),
barmode = 'stack')
top10.allbot10.all <- plot_ly(bottom10.df.whr, x = ~Country, y = ~Positive.Affect,
type = 'bar', name = 'Positive Affect')
bot10.all <- bot10.all %>% add_trace(y = ~Negative.Affect, name = 'Negative Affect')
bot10.all <- bot10.all %>% add_trace(y = ~Social.Support, name = 'Social Support')
bot10.all <- bot10.all %>% add_trace(y = ~Freedom, name = 'Freedom')
bot10.all <- bot10.all %>% add_trace(y = ~Corruption, name = 'Corruption')
bot10.all <- bot10.all %>% add_trace(y = ~Generosity, name = 'Generosity')
bot10.all <- bot10.all %>% add_trace(y = ~Log.of.GDP, name = 'Log of GDP')
bot10.all <- bot10.all %>% add_trace(y = ~Healthy.Life, name = 'Healthy Life')
bot10.all <- bot10.all %>% layout(title = "Bottom 10: All Attributes",
xaxis = list(title = 'Country'),
yaxis = list(range=c(0,1200), title = 'Count'),
barmode = 'stack')
bot10.alltop10.df.whr <- top10.df.whr %>% mutate(Level = "Top10")
bottom10.df.whr <- bottom10.df.whr %>% mutate(Level = "Bottom10")
comparison <- bind_rows(top10.df.whr, bottom10.df.whr)
comparison$Level <- as.factor(comparison$Level)
comparison <- transform(comparison, Level = factor(Level, levels = c("Top10", "Bottom10" )))
comparison.score <- comparison %>% gather(key = "columns", value = "score", Happiness.Rank:Healthy.Life)
comparison.score %>%
ggplot(aes(x = Level, y = score, colour = Level, fill = Level)) +
geom_boxplot(position=position_dodge(width=1)) + facet_wrap(~columns, scales = "free")We will now analyze how the Happiness score is impacted by Continents or Geographical presence of the Countries.
We have plot below the distribution of some key attributes that impact the Happiness score.
df.notNA.Africa <- subset(df.whr.notNA, df.whr.notNA$Continent == "Africa")
df.notNA.Asia <- subset(df.whr.notNA, df.whr.notNA$Continent == "Asia")
df.notNA.Aus <- subset(df.whr.notNA, df.whr.notNA$Continent == "Australia")
df.notNA.Europe <- subset(df.whr.notNA, df.whr.notNA$Continent == "Europe")
df.notNA.NorthAm <- subset(df.whr.notNA, df.whr.notNA$Continent == "North America")
df.notNA.SouthAm <- subset(df.whr.notNA, df.whr.notNA$Continent == "South America")Log of GDP
The Log of GDP distribution is having higher or worst scores in the African, Asian and South American continent.
Europe and Australian continent seem to have better scores for Log Of GDP compared to other continents.
The Log of GDP scores vary a lot in the North American countries.
par(mfrow=c(3,2))
# Africa
plot(df.notNA.Africa$Log.of.GDP, type = "h",
xlim = c(0, 46), ylim = c(0, 156),
xaxt = "n", yaxt = "n",
xlab = "x", ylab = "Rank",
col = "deepskyblue2", main = "Africa: Log of GDP")
axis(side = 1, at = 0:46, labels = TRUE)
axis(side = 2, at = seq(0, 156, 10), labels = TRUE)
points(1:length(df.notNA.Africa$Log.of.GDP), df.notNA.Africa$Log.of.GDP, pch = 16)
abline(h = 0, col="red")
# Asia
plot(df.notNA.Asia$Log.of.GDP, type = "h",
xlim = c(0, 46), ylim = c(0, 156),
xaxt = "n", yaxt = "n",
xlab = "x", ylab = "Rank",
col = "deepskyblue2", main = "Asia: Log of GDP")
axis(side = 1, at = 0:46, labels = TRUE)
axis(side = 2, at = seq(0, 156, 10), labels = TRUE)
points(1:length(df.notNA.Asia$Log.of.GDP), df.notNA.Asia$Log.of.GDP, pch = 16)
abline(h = 0, col="red")
# Europe
plot(df.notNA.Europe$Log.of.GDP, type = "h",
xlim = c(0, 46), ylim = c(0, 156),
xaxt = "n", yaxt = "n",
xlab = "x", ylab = "Rank",
col = "deepskyblue2", main = "Europe: Log of GDP")
axis(side = 1, at = 0:46, labels = TRUE)
axis(side = 2, at = seq(0, 156, 10), labels = TRUE)
points(1:length(df.notNA.Europe$Log.of.GDP), df.notNA.Europe$Log.of.GDP, pch = 16)
abline(h = 0, col="red")
# Australia
plot(df.notNA.Aus$Log.of.GDP, type = "h",
xlim = c(0, 46), ylim = c(0, 156),
xaxt = "n", yaxt = "n",
xlab = "x", ylab = "Rank",
col = "deepskyblue2", main = "Australia: Log of GDP")
axis(side = 1, at = 0:46, labels = TRUE)
axis(side = 2, at = seq(0, 156, 10), labels = TRUE)
points(1:length(df.notNA.Aus$Log.of.GDP), df.notNA.Aus$Log.of.GDP, pch = 16)
abline(h = 0, col="red")
# North America
plot(df.notNA.NorthAm$Log.of.GDP, type = "h",
xlim = c(0, 46), ylim = c(0, 156),
xaxt = "n", yaxt = "n",
xlab = "x", ylab = "Rank",
col = "deepskyblue2", main = "North America: Log of GDP")
axis(side = 1, at = 0:46, labels = TRUE)
axis(side = 2, at = seq(0, 156, 10), labels = TRUE)
points(1:length(df.notNA.NorthAm$Log.of.GDP), df.notNA.NorthAm$Log.of.GDP, pch = 16)
abline(h = 0, col="red")
# South America
plot(df.notNA.SouthAm$Log.of.GDP, type = "h",
xlim = c(0, 46), ylim = c(0, 156),
xaxt = "n", yaxt = "n",
xlab = "x", ylab = "Rank",
col = "deepskyblue2", main = "South America: Log of GDP")
axis(side = 1, at = 0:46, labels = TRUE)
axis(side = 2, at = seq(0, 156, 10), labels = TRUE)
points(1:length(df.notNA.SouthAm$Log.of.GDP), df.notNA.SouthAm$Log.of.GDP, pch = 16)
abline(h = 0, col="red")Healthy life expectancy
Healthy Life expectancy follows almost similar distribution to Log of GDP.
This indicates the correlation between the 2 attributes is very linear.
We see that again Africa seems to have the worst Healthy Life expectancy scores while Europe and Australian continents have the best scores.
par(mfrow=c(3,2))
# Africa
plot(df.notNA.Africa$Healthy.Life, type = "h",
xlim = c(0, 46), ylim = c(0, 156),
xaxt = "n", yaxt = "n",
xlab = "x", ylab = "Rank",
col = "forestgreen", main = "Africa: Healthy Life expectancy")
axis(side = 1, at = 0:46, labels = TRUE)
axis(side = 2, at = seq(0, 156, 10), labels = TRUE)
points(1:length(df.notNA.Africa$Healthy.Life), df.notNA.Africa$Healthy.Life, pch = 16)
abline(h = 0, col="red")
# Asia
plot(df.notNA.Asia$Healthy.Life, type = "h",
xlim = c(0, 46), ylim = c(0, 156),
xaxt = "n", yaxt = "n",
xlab = "x", ylab = "Rank",
col = "forestgreen", main = "Asia: Healthy Life expectancy")
axis(side = 1, at = 0:46, labels = TRUE)
axis(side = 2, at = seq(0, 156, 10), labels = TRUE)
points(1:length(df.notNA.Asia$Healthy.Life), df.notNA.Asia$Healthy.Life, pch = 16)
abline(h = 0, col="red")
# Europe
plot(df.notNA.Europe$Healthy.Life, type = "h",
xlim = c(0, 46), ylim = c(0, 156),
xaxt = "n", yaxt = "n",
xlab = "x", ylab = "Rank",
col = "forestgreen", main = "Europe: Healthy Life expectancy")
axis(side = 1, at = 0:46, labels = TRUE)
axis(side = 2, at = seq(0, 156, 10), labels = TRUE)
points(1:length(df.notNA.Europe$Healthy.Life), df.notNA.Europe$Healthy.Life, pch = 16)
abline(h = 0, col="red")
# Australia
plot(df.notNA.Aus$Healthy.Life, type = "h",
xlim = c(0, 46), ylim = c(0, 156),
xaxt = "n", yaxt = "n",
xlab = "x", ylab = "Rank",
col = "forestgreen", main = "Australia: Healthy Life expectancy")
axis(side = 1, at = 0:46, labels = TRUE)
axis(side = 2, at = seq(0, 156, 10), labels = TRUE)
points(1:length(df.notNA.Aus$Healthy.Life), df.notNA.Aus$Healthy.Life, pch = 16)
abline(h = 0, col="red")
# North America
plot(df.notNA.NorthAm$Healthy.Life, type = "h",
xlim = c(0, 46), ylim = c(0, 156),
xaxt = "n", yaxt = "n",
xlab = "x", ylab = "Rank",
col = "forestgreen", main = "North America: Healthy Life expectancy")
axis(side = 1, at = 0:46, labels = TRUE)
axis(side = 2, at = seq(0, 156, 10), labels = TRUE)
points(1:length(df.notNA.NorthAm$Healthy.Life), df.notNA.NorthAm$Healthy.Life, pch = 16)
abline(h = 0, col="red")
# South America
plot(df.notNA.SouthAm$Healthy.Life, type = "h",
xlim = c(0, 46), ylim = c(0, 156),
xaxt = "n", yaxt = "n",
xlab = "x", ylab = "Rank",
col = "forestgreen", main = "South America: Healthy Life expectancy")
axis(side = 1, at = 0:46, labels = TRUE)
axis(side = 2, at = seq(0, 156, 10), labels = TRUE)
points(1:length(df.notNA.SouthAm$Healthy.Life), df.notNA.SouthAm$Healthy.Life, pch = 16)
abline(h = 0, col="red")Corruption
The Corruption attribute seems to vary a lot in all the continents.
However, Europe and Australian continents have lot of low corruption scores.
par(mfrow=c(3,2))
# Africa
plot(df.notNA.Africa$Corruption, type = "h",
xlim = c(0, 46), ylim = c(0, 156),
xaxt = "n", yaxt = "n",
xlab = "x", ylab = "Rank",
col = "firebrick2", main = "Africa: Corruption")
axis(side = 1, at = 0:46, labels = TRUE)
axis(side = 2, at = seq(0, 156, 10), labels = TRUE)
points(1:length(df.notNA.Africa$Corruption), df.notNA.Africa$Corruption, pch = 16)
abline(h = 0, col="red")
# Asia
plot(df.notNA.Asia$Corruption, type = "h",
xlim = c(0, 46), ylim = c(0, 156),
xaxt = "n", yaxt = "n",
xlab = "x", ylab = "Rank",
col = "firebrick2", main = "Asia: Corruption")
axis(side = 1, at = 0:46, labels = TRUE)
axis(side = 2, at = seq(0, 156, 10), labels = TRUE)
points(1:length(df.notNA.Asia$Corruption), df.notNA.Asia$Corruption, pch = 16)
abline(h = 0, col="red")
# Europe
plot(df.notNA.Europe$Corruption, type = "h",
xlim = c(0, 46), ylim = c(0, 156),
xaxt = "n", yaxt = "n",
xlab = "x", ylab = "Rank",
col = "firebrick2", main = "Europe: Corruption")
axis(side = 1, at = 0:46, labels = TRUE)
axis(side = 2, at = seq(0, 156, 10), labels = TRUE)
points(1:length(df.notNA.Europe$Corruption), df.notNA.Europe$Corruption, pch = 16)
abline(h = 0, col="red")
# Australia
plot(df.notNA.Aus$Corruption, type = "h",
xlim = c(0, 46), ylim = c(0, 156),
xaxt = "n", yaxt = "n",
xlab = "x", ylab = "Rank",
col = "firebrick2", main = "Australia: Corruption")
axis(side = 1, at = 0:46, labels = TRUE)
axis(side = 2, at = seq(0, 156, 10), labels = TRUE)
points(1:length(df.notNA.Aus$Corruption), df.notNA.Aus$Corruption, pch = 16)
abline(h = 0, col="red")
# North America
plot(df.notNA.NorthAm$Corruption, type = "h",
xlim = c(0, 46), ylim = c(0, 156),
xaxt = "n", yaxt = "n",
xlab = "x", ylab = "Rank",
col = "firebrick2", main = "North America: Corruption")
axis(side = 1, at = 0:46, labels = TRUE)
axis(side = 2, at = seq(0, 156, 10), labels = TRUE)
points(1:length(df.notNA.NorthAm$Corruption), df.notNA.NorthAm$Corruption, pch = 16)
abline(h = 0, col="red")
# South America
plot(df.notNA.SouthAm$Corruption, type = "h",
xlim = c(0, 46), ylim = c(0, 156),
xaxt = "n", yaxt = "n",
xlab = "x", ylab = "Rank",
col = "firebrick2", main = "South America: Corruption")
axis(side = 1, at = 0:46, labels = TRUE)
axis(side = 2, at = seq(0, 156, 10), labels = TRUE)
points(1:length(df.notNA.SouthAm$Corruption), df.notNA.SouthAm$Corruption, pch = 16)
abline(h = 0, col="red")Negative Affect
The Negative Emotion varies a lot in Asia and Europe.
The other continents namely Africa and South America have some higher Negative emotion scores.
par(mfrow=c(3,2))
# Africa
plot(df.notNA.Africa$Negative.Affect, type = "h",
xlim = c(0, 46), ylim = c(0, 156),
xaxt = "n", yaxt = "n",
xlab = "x", ylab = "Rank",
col = "maroon4", main = "Africa: Negative Affect")
axis(side = 1, at = 0:46, labels = TRUE)
axis(side = 2, at = seq(0, 156, 10), labels = TRUE)
points(1:length(df.notNA.Africa$Negative.Affect), df.notNA.Africa$Negative.Affect, pch = 16)
abline(h = 0, col="red")
# Asia
plot(df.notNA.Asia$Negative.Affect, type = "h",
xlim = c(0, 46), ylim = c(0, 156),
xaxt = "n", yaxt = "n",
xlab = "x", ylab = "Rank",
col = "maroon4", main = "Asia: Negative Affect")
axis(side = 1, at = 0:46, labels = TRUE)
axis(side = 2, at = seq(0, 156, 10), labels = TRUE)
points(1:length(df.notNA.Asia$Negative.Affect), df.notNA.Asia$Negative.Affect, pch = 16)
abline(h = 0, col="red")
# Europe
plot(df.notNA.Europe$Negative.Affect, type = "h",
xlim = c(0, 46), ylim = c(0, 156),
xaxt = "n", yaxt = "n",
xlab = "x", ylab = "Rank",
col = "maroon4", main = "Europe: Negative Affect")
axis(side = 1, at = 0:46, labels = TRUE)
axis(side = 2, at = seq(0, 156, 10), labels = TRUE)
points(1:length(df.notNA.Europe$Negative.Affect), df.notNA.Europe$Negative.Affect, pch = 16)
abline(h = 0, col="red")
# Australia
plot(df.notNA.Aus$Negative.Affect, type = "h",
xlim = c(0, 46), ylim = c(0, 156),
xaxt = "n", yaxt = "n",
xlab = "x", ylab = "Rank",
col = "maroon4", main = "Australia: Negative Affect")
axis(side = 1, at = 0:46, labels = TRUE)
axis(side = 2, at = seq(0, 156, 10), labels = TRUE)
points(1:length(df.notNA.Aus$Negative.Affect), df.notNA.Aus$Negative.Affect, pch = 16)
abline(h = 0, col="red")
# North America
plot(df.notNA.NorthAm$Negative.Affect, type = "h",
xlim = c(0, 46), ylim = c(0, 156),
xaxt = "n", yaxt = "n",
xlab = "x", ylab = "Rank",
col = "maroon4", main = "North America: Negative Affect")
axis(side = 1, at = 0:46, labels = TRUE)
axis(side = 2, at = seq(0, 156, 10), labels = TRUE)
points(1:length(df.notNA.NorthAm$Negative.Affect), df.notNA.NorthAm$Negative.Affect, pch = 16)
abline(h = 0, col="red")
# South America
plot(df.notNA.SouthAm$Negative.Affect, type = "h",
xlim = c(0, 46), ylim = c(0, 156),
xaxt = "n", yaxt = "n",
xlab = "x", ylab = "Rank",
col = "maroon4", main = "South America: Negative Affect")
axis(side = 1, at = 0:46, labels = TRUE)
axis(side = 2, at = seq(0, 156, 10), labels = TRUE)
points(1:length(df.notNA.SouthAm$Negative.Affect), df.notNA.SouthAm$Negative.Affect, pch = 16)
abline(h = 0, col="red")bp.cont <- plot_ly(df.whr.notNA, x = ~df.whr.notNA[df.whr.notNA$Continent == "Asia" , 2], type="box", name = 'Asia')
bp.cont <- add_trace(bp.cont, x = ~df.whr.notNA[df.whr.notNA$Continent == "Africa" , 2], name = 'Africa')
bp.cont <- add_trace(bp.cont, x = ~df.whr.notNA[df.whr.notNA$Continent == "Australia" , 2], name = 'Australia')
bp.cont <- add_trace(bp.cont, x = ~df.whr.notNA[df.whr.notNA$Continent == "Europe" , 2], name = 'Europe')
bp.cont <- add_trace(bp.cont, x = ~df.whr.notNA[df.whr.notNA$Continent == "North America" , 2], name = 'North America')
bp.cont <- add_trace(bp.cont, x = ~df.whr.notNA[df.whr.notNA$Continent == "South America" , 2], name = 'South America')
bp.cont <- bp.cont %>% layout(title = "Happiness Rank by Continent",
yaxis = list(title = "Continent"),
xaxis = list(title = "Rank"))
bp.contddf <- subset(df.whr[c("Country","Happiness.Rank")])
spdf <- joinCountryData2Map(ddf, joinCode="NAME", nameJoinColumn="Country")## 154 codes from your data successfully matched countries in the map
## 2 codes from your data failed to match with a country code in the map
## 89 codes from the map weren't represented in your data
We will use numerical variable: Happiness Rank
hist(df.whr.notNA$Happiness.Rank, prob = TRUE,
col = "skyblue",
xlab = "x",
ylab = "Density",
ylim = c(0, 0.01),
xlim = c(0,165),
xaxt = "n",
yaxt = "n",
breaks = 30,
main = paste("Happiness Rank Dist."))
axis(side = 1, at = seq(from = 0, to = 165, by = 5))
axis(side = 2, at = seq(from = 0, to = 0.01, by = 0.001))We will now draw 100 samples from the dataset (with no missing column values).
We will then take 10, 20, 30 and 40 mean value samples and plot them to demonstrate the Central Limit Theorem.
set.seed(6118)
par(mfrow = c(2,2))
samples <- 1000
xbar <- numeric(samples)
mean <- numeric(4)
sd <- numeric(4)
cntr <- 1
for (size in c(10, 20, 30, 40)) {
for (i in 1:samples) {
xbar[i] <- mean(sample(df.whr.notNA$Happiness.Rank, size, replace = TRUE))
}
hist(xbar, prob = TRUE,
col = "skyblue",
xlab = "x",
ylab = "Density",
ylim = c(0, 0.06),
xlim = c(0,150),
xaxt = "n",
yaxt = "n",
main = paste("Sample Size =", size))
axis(side = 1, at = seq(from = 0, to = 150, by = 5))
axis(side = 2, at = seq(from = 0, to = 1, by = 0.01))
mean[cntr] <- mean(xbar)
sd[cntr] <- sd(xbar)
cntr <- cntr + 1
}par(mfrow = c(1,1))
paste("Sample Size = ", c(10, 20, 30, 40), " Mean = ", round(mean, 2), " SD = ", round(sd, 2))## [1] "Sample Size = 10 Mean = 78.84 SD = 14.23"
## [2] "Sample Size = 20 Mean = 79.45 SD = 10.3"
## [3] "Sample Size = 30 Mean = 79.14 SD = 8.36"
## [4] "Sample Size = 40 Mean = 79.2 SD = 7.44"
With 1000 samples we can see that the different mean value samples form a normal distribution plot.
As the Sample size increases, we see that the mean of the sample means remain almost same. However, the Standard Deviation keeps getting smaller indicating that the Normal distribution plot keeps getting narrower.
In this sampling method we will pull 20 random samples from the dataset (with no missing column values).
We will perform the sampling without replacement.
set.seed(6118)
N <- nrow(df.whr.notNA)
n <- 20
samp <- srswor(n, N)
samples2 <- df.whr.notNA[samp != 0, ]
samples2## Country Happiness.Rank SD.of.Ladder Positive.Affect Negative.Affect
## 2 Denmark 2 13 24 26
## 3 Norway 3 8 16 29
## 10 Austria 10 10 64 24
## 32 Brazil 32 116 69 105
## 38 Slovakia 38 39 53 47
## 43 Colombia 43 120 30 88
## 56 Jamaica 56 102 51 51
## 67 Pakistan 67 53 130 111
## 71 Moldova 71 45 133 67
## 89 Morocco 89 101 110 91
## 97 Bulgaria 97 47 117 13
## 98 Ghana 98 129 92 72
## 107 Albania 107 126 90 108
## 115 Burkina Faso 115 92 115 117
## 126 Iraq 126 147 151 154
## 137 Egypt 137 66 146 124
## 140 India 140 41 93 115
## 142 Comoros 142 143 67 114
## 143 Madagascar 143 77 46 96
## 147 Haiti 147 111 142 119
## Social.Support Freedom Corruption Generosity Log.of.GDP Healthy.Life
## 2 4 6 3 22 14 23
## 3 3 3 8 11 7 12
## 10 31 26 19 25 16 15
## 32 43 84 71 108 70 72
## 38 21 108 142 70 35 38
## 43 52 56 124 111 74 51
## 56 28 49 130 119 93 55
## 67 130 114 55 58 110 114
## 71 65 128 148 86 109 86
## 89 139 76 84 154 98 79
## 97 18 115 147 112 56 65
## 98 132 91 117 52 114 121
## 107 133 87 134 60 81 40
## 115 116 127 47 125 137 136
## 126 124 130 66 73 64 107
## 137 118 129 89 132 85 101
## 140 142 41 73 65 103 105
## 142 143 148 81 62 143 117
## 143 128 146 116 136 144 111
## 147 146 152 48 20 138 125
## Continent
## 2 Europe
## 3 Europe
## 10 Europe
## 32 South America
## 38 Europe
## 43 South America
## 56 North America
## 67 Asia
## 71 Europe
## 89 Africa
## 97 Europe
## 98 Africa
## 107 Europe
## 115 Africa
## 126 Asia
## 137 Africa
## 140 Asia
## 142 Africa
## 143 Africa
## 147 North America
From the above plot, we conclude below:
Our original conclusions have remained intact.
Social Support, Healthy Life expectancy and Log of GDP are still the most contributing factors for a better Happiness Score.
We do see that Countries in the selected sample have higher Negative Affect correlation to Happiness Rank.
Log of GDP and Healthy Life expectancy are still the most correlated attributes in the selected sample.
In this sampling method, we will pull 20 random samples from the dataset (with no missing column values).
We will perform the sampling without replacement.
It so happens that we will take the 7th row from the dataset to form the sample dataset.
set.seed(6118)
N <- nrow(df.whr.notNA)
n <- 20
# items in each group
k <- ceiling(N / n)
# random item from first group
r <- sample(k, 1)
# select every kth item
s <- seq(r, by = k, length = n)
samples3 <- df.whr.notNA[s, ]
samples3## Country Happiness.Rank SD.of.Ladder Positive.Affect
## 6 Switzerland 6 11 44
## 13 Israel 13 14 104
## 20 Czech Republic 20 20 74
## 31 Panama 31 121 7
## 39 Trinidad and Tobago 39 89 14
## 47 Argentina 47 97 28
## 55 Estonia 55 32 50
## 62 Hungary 62 36 86
## 70 Serbia 70 100 148
## 79 Turkey 79 58 154
## 86 Kyrgyzstan 86 46 58
## 95 Bhutan 95 6 37
## 103 Congo (Brazzaville) 103 152 124
## 111 Senegal 111 44 68
## 119 Georgia 119 51 141
## 126 Iraq 126 147 151
## 133 Ukraine 133 69 131
## 141 Liberia 141 156 103
## 148 Botswana 148 125 87
## 156 South Sudan 156 140 127
## Negative.Affect Social.Support Freedom Corruption Generosity Log.of.GDP
## 6 21 13 11 7 16 8
## 13 69 38 93 74 24 31
## 20 22 24 58 121 117 32
## 31 48 41 32 104 88 51
## 39 52 29 51 141 41 38
## 47 93 46 54 109 123 55
## 55 6 12 45 30 83 37
## 62 31 51 138 140 100 42
## 70 92 57 124 118 84 71
## 79 121 61 140 50 98 44
## 86 4 45 38 138 36 120
## 95 98 68 59 25 13 95
## 103 136 138 92 60 140 111
## 111 60 106 121 88 130 126
## 119 43 147 104 28 153 87
## 126 154 124 130 66 73 64
## 133 44 56 141 143 66 94
## 141 146 127 94 126 110 150
## 148 65 105 60 54 150 66
## 156 152 148 154 61 85 140
## Healthy.Life Continent
## 6 4 Europe
## 13 11 Asia
## 20 31 Europe
## 31 33 North America
## 39 93 North America
## 47 37 South America
## 55 41 Europe
## 62 56 Europe
## 70 48 Europe
## 79 69 Asia
## 86 91 Asia
## 95 104 Asia
## 103 116 Africa
## 111 109 Africa
## 119 84 Asia
## 126 107 Asia
## 133 87 Europe
## 141 126 Africa
## 148 113 Africa
## 156 143 Africa
From the above plot, we conclude below:
Our original conclusions have remained intact.
Social Support, Healthy Life expectancy and Log of GDP are still the most contributing factors for a better Happiness Score.
We do see that Countries in the selected sample have lower Corruption correlation that indicate that there are some of the high Corruption ranking countries in the selected sample.
Log of GDP and Healthy Life expectancy are still the most correlated attributes in the selected sample.
In the stratified sampling process, we will Order the dataset (with no missing column values) by the Continent name.
We will then pull 40 proportional samples based on the number of Countries belonging to the respective Continents and create our sample dataset.
We will perform the sampling with replacement.
n <- 40
# Ordering of data based on Continent:
order.index <- order(df.whr.notNA$Continent)
ordered.countries <- df.whr.notNA[order.index, ]
freq4 <- table(ordered.countries$Continent)
st.sizes <- n * freq4 / sum(freq4)
st.sizes <- round(st.sizes)
st <- sampling::strata(ordered.countries, stratanames = c("Continent"),
size = st.sizes, method = "srswr",
description = TRUE)## Stratum 1
##
## Population total and number of selected units: 43 12
## Stratum 2
##
## Population total and number of selected units: 32 9
## Stratum 3
##
## Population total and number of selected units: 2 1
## Stratum 4
##
## Population total and number of selected units: 39 11
## Stratum 5
##
## Population total and number of selected units: 13 4
## Stratum 6
##
## Population total and number of selected units: 10 3
## Number of strata 6
## Total number of selected units 40
## Country Happiness.Rank SD.of.Ladder Positive.Affect
## 57 Mauritius 57 94 55
## 57.1 Mauritius 57 94 55
## 122 Mauritania 122 68 94
## 124 Tunisia 124 79 147
## 127 Congo (Kinshasa) 127 78 125
## 129 Sierra Leone 129 153 139
## 136 Uganda 136 148 91
## 139 Togo 139 103 123
## 141 Liberia 141 156 103
## 142 Comoros 142 143 67
## 145 Burundi 145 138 98
## 153 Tanzania 153 122 78
## 34 Singapore 34 5 38
## 41 Uzbekistan 41 99 19
## 52 Thailand 52 81 20
## 54 South Korea 54 57 101
## 74 Tajikistan 74 50 120
## 79 Turkey 79 58 154
## 80 Malaysia 80 12 25
## 95 Bhutan 95 6 37
## 151 Yemen 151 85 153
## 8 New Zealand 8 15 22
## 7 Sweden 7 18 34
## 15 United Kingdom 15 16 52
## 18 Belgium 18 7 57
## 42 Lithuania 42 55 138
## 42.1 Lithuania 42 55 138
## 44 Slovenia 44 54 114
## 48 Romania 48 75 80
## 62 Hungary 62 36 86
## 66 Portugal 66 73 97
## 81 Belarus 81 22 149
## 97 Bulgaria 97 47 117
## 9 Canada 9 23 18
## 12 Costa Rica 12 62 4
## 77 Dominican Republic 77 155 66
## 147 Haiti 147 111 142
## 32 Brazil 32 116 69
## 33 Uruguay 33 88 10
## 47 Argentina 47 97 28
## Negative.Affect Social.Support Freedom Corruption Generosity Log.of.GDP
## 57 16 54 40 96 37 53
## 57.1 16 54 40 96 37 53
## 122 58 99 151 67 148 117
## 124 132 121 143 101 144 84
## 127 95 107 125 106 127 149
## 129 149 135 116 112 79 145
## 136 139 114 99 95 74 136
## 139 147 149 120 72 131 142
## 141 146 127 94 126 110 150
## 142 114 143 148 81 62 143
## 145 126 152 135 23 149 151
## 153 50 131 78 34 49 125
## 34 2 36 20 1 21 3
## 41 15 11 1 18 29 104
## 52 35 53 18 131 10 62
## 54 45 91 144 100 40 27
## 74 54 113 86 35 72 123
## 79 121 61 140 50 98 44
## 80 23 97 36 137 27 40
## 95 98 68 59 25 13 95
## 151 75 100 147 83 155 141
## 8 12 5 8 5 8 26
## 7 8 25 10 6 17 13
## 15 42 9 63 15 4 23
## 18 53 22 53 20 44 21
## 42 41 17 122 113 124 36
## 42.1 41 17 122 113 124 36
## 44 71 14 13 97 54 34
## 48 62 86 57 146 102 48
## 62 31 51 138 140 100 42
## 66 100 47 37 135 122 39
## 81 36 33 131 37 103 58
## 97 13 18 115 147 112 56
## 9 49 20 9 11 14 19
## 12 87 42 16 58 75 67
## 77 77 55 43 52 93 69
## 147 119 146 152 48 20 138
## 32 105 43 84 71 108 70
## 33 76 35 30 33 80 52
## 47 93 46 54 109 123 55
## Healthy.Life Continent ID_unit Prob Stratum
## 57 73 Africa 1 0.2460023 1
## 57.1 73 Africa 1 0.2460023 1
## 122 120 Africa 20 0.2460023 1
## 124 67 Africa 22 0.2460023 1
## 127 140 Africa 23 0.2460023 1
## 129 146 Africa 25 0.2460023 1
## 136 127 Africa 28 0.2460023 1
## 139 132 Africa 31 0.2460023 1
## 141 126 Africa 32 0.2460023 1
## 142 117 Africa 33 0.2460023 1
## 145 135 Africa 36 0.2460023 1
## 153 118 Africa 41 0.2460023 1
## 34 1 Asia 45 0.2485407 2
## 41 83 Asia 46 0.2485407 2
## 52 58 Asia 47 0.2485407 2
## 54 9 Asia 48 0.2485407 2
## 74 92 Asia 53 0.2485407 2
## 79 69 Asia 54 0.2485407 2
## 80 59 Asia 55 0.2485407 2
## 95 104 Asia 62 0.2485407 2
## 151 124 Asia 74 0.2485407 2
## 8 14 Australia 76 0.5000000 3
## 7 17 Europe 84 0.2485348 4
## 15 24 Europe 87 0.2485348 4
## 18 26 Europe 90 0.2485348 4
## 42 62 Europe 98 0.2485348 4
## 42.1 62 Europe 98 0.2485348 4
## 44 29 Europe 99 0.2485348 4
## 48 61 Europe 100 0.2485348 4
## 62 56 Europe 104 0.2485348 4
## 66 22 Europe 105 0.2485348 4
## 81 76 Europe 111 0.2485348 4
## 97 65 Europe 114 0.2485348 4
## 9 8 North America 117 0.2739750 5
## 12 28 North America 118 0.2739750 5
## 77 80 North America 128 0.2739750 5
## 147 125 North America 129 0.2739750 5
## 32 72 South America 131 0.2710000 6
## 33 35 South America 132 0.2710000 6
## 47 37 South America 134 0.2710000 6
From the above plot, we conclude below:
Our original conclusions have remained intact.
Social Support, Healthy Life expectancy and Log of GDP are still the most contributing factors for a better Happiness Score.
Log of GDP and Healthy Life expectancy are still the most correlated attributes in the selected sample.